Anomalies are data points that are different from other observations in some way, typically measured against a model fit to the data. On the contrary with the ordinary descriptive statistics, we are interested here to found where these anomalous data points exist and not exclude them as outliers.
We assume the anomaly detection task is unsupervised, i.e. we don’t have training data with points labeled as anomalous. Each data point passed to an anomaly detection model is given a score indicating how different the point is relative to the rest of the dataset. The calculation of this score varies between models, but a higher score always indicates a point is more anomalous. Often a threshold is chosen to make a final classification of each point as typical or anomalous; this post-processing step is left to the user.
The GraphLab Create (GLC) Anomaly Detection toolkit currently includes three models for two different data contexts:
In this short note, we demonstrate how the GLC Local Outlier Factor Model can be used to reveal anomalies in a multivariate data set. We will use the customer data from a recent AirBnB New User Bookings competition on Kaggle. More specifically, we have downloaded a copy of the file train_users_2.csv
in our working directory. Each row in this dataset describes one of 213,451 AirBnB users; there is a mix of basic features, such as gender
, age
, and preferred language
, as well as the user's "technology profile", including the browser type, device type, and his/her sign-up method.
In [1]:
import graphlab as gl
from visualization_helper_functions import *
In [2]:
customer_data = gl.SFrame.read_csv('./train_users_2.csv')
In [3]:
customer_data.head(5)
Out[3]:
For the needs of our current presentation we will only need a small subset of the available basic customer features, i.e. 'gender'
, 'age'
and 'language'
.
In [4]:
features = ['gender', 'age', 'language']
customer_data = customer_data[['id']+features]
customer_data
Out[4]:
From the quick exploratory data analysis below:
In [5]:
%matplotlib inline
univariate_summary_plot(customer_data, features, nsubplots_inrow=3, subplots_wspace=0.7)
In [6]:
gl.canvas.set_target('browser')
customer_data[['age']].show()
In [7]:
print 'Number of customer records with ages larger than 2013: %d' %\
len(customer_data[customer_data['age'] >= 2013])
we notice that there about 750 records having an 'age'
value of '2013'
or '2014'
, which is of course wrong. Most probably the year was recorded accidentally in this field. The remaining 'age'
values seams absolutely reasonable with only some rare customer entries that have ages greater than '100'
. In fact more than 128 thousand customer entries are found to have ages in the [1, 142]
interval. More specifically, we have choosen to assume any value falling in the [1,150]
interval as an elligible recording of a customer age, re-assigning all the remaining ones as missing
:
In [8]:
customer_data['age'] = customer_data['age'].apply(lambda age: age if age < 150 else None)
customer_data = customer_data.dropna(columns = features, how='any')
print 'Number of Rows in dataset: %d' % len(customer_data)
Now, the univariate summary statistics of the customer_data
set takes the form:
In [9]:
univariate_summary_plot(customer_data, features, nsubplots_inrow=3, subplots_wspace=0.7)
and more specifically the remaining customer ages follow the distribution below:
In [10]:
# transform the SFrame into a Pandas DataFrame
customer_data_df = customer_data.to_dataframe()
customer_data_df['gender'] = customer_data_df['gender'].astype(str)
customer_data_df['age'] = customer_data_df['age'].astype(float)
customer_data_df['language'] = customer_data_df['language'].astype(str)
In [12]:
# define seaborn style, palette, color codes
sns.set(style="whitegrid", palette="deep",color_codes=True)
# initialize the matplotlib figure
plt.figure(figsize=(12,7))
# draw distplot
ax1 = sns.distplot(customer_data_df.age, bins=None, hist=True, kde=False, rug=False, color='b')
If we would like to explore in more detail the countplot for the variable language
, we can temporarily exclude the english-speaking customers and redraw the graph:
In [13]:
# exclude the english-speaking customers
customer_data_df_nen = customer_data_df[customer_data_df['language']!='en']
# define seaborn style, palette, color codes
sns.set(style="whitegrid", palette="deep",color_codes=False)
# initialize the matplotlib figure
plt.figure(figsize=(7,11))
plt.ylabel('language', {'fontweight': 'bold'})
plt.title('Countplot of Customer Languages\n[English-speaking people excluded]',
{'fontweight': 'bold'})
# draw countplot
ax2 = sns.countplot(y='language', data=customer_data_df_nen, palette='deep', color='b')
The univariate summary statistics plot for this new customer_data_df_nen
set is as follows.
In [14]:
univariate_summary_plot(customer_data_df_nen, features, subplots_wspace=0.7)
The data set of interest, customer_data, has two nominal categorical variables:
'gender'
: nominal categorical attribute (FEMALE/MALE/unknown/OTHER
)'language'
: nominal categorical attribute of 25 different languages.which we should better encode them prior of applying any learning algorithm. To do so we will apply the OneHotEncoding transformation as shown below:
In [15]:
one_hot_encoder = gl.toolkits.feature_engineering.OneHotEncoder(features=['gender', 'language'])
customer_data1 = one_hot_encoder.fit_transform(customer_data)
Local Outlier Factor (LOF) Models are distance-based learning algorithms. Therefore, we need to standardize the 'age'
feature in order to be on roughly the same scale as the encoded categorical variables.
In [16]:
customer_data1['age'] = (customer_data['age'] - customer_data['age'].mean())/\
customer_data['age'].std()
customer_data1
Out[16]:
Next, we train the LOF model by using this transformed customer_data2 set.
In [17]:
model_lof = gl.anomaly_detection.local_outlier_factor.create(customer_data1,
features = ['age', 'encoded_features'],
threshold_distances=True,
verbose=False)
In [18]:
model_lof.save('./model_lof')
model_lof = gl.load_model('./model_lof/')
In [19]:
print 'The LOF model has been trained with the following options:'
print '-------------------------------------------------------------'
print model_lof.get_current_options()
Note that the model can automatically choose a suitable metric for the data type of the features we have available. Here, a composite distance of a 'jaccard'
and 'euclidean'
metric has been chosen for the 'encoded_features'
and the 'age'
columns respectively. Both these two metrics have been weighted with 1.0
.
If we want what has been built by the model internally we can simply write:
In [20]:
print model_lof
More importantly, here is the SFrame with the LOF anomaly scores:
In [21]:
model_lof['scores']
Out[21]:
Firstly, note that the model worked successfully, scoring each of the 124,681 input rows. Secondly, the anomaly score for many observations in our AirBnB dataset is nan
which indicates the point has many neighbors at exactly the same location, making the ratio of densities undefined. These points cannot be outliers.
However, for the problem at hand we are interested to find if any outliers exist and under what circumstances this happens. This is where the real business value exists!
In [22]:
top10_anomalies = model_lof['scores'].topk('anomaly_score', k=10)
top10_anomalies.print_rows(num_rows=10)
Note that the anomaly scores for these points are infinite, which happens when a point is next to several identical points, but is not itself a member of that bunch. These points are certainly anomalous, but our specific choice of k was arbitrary and excluded many points that are also likely anomalous.
B. Choose a threshold, either from domain knowledge or scientific expertise in order to find the anomalous observations in your data set:
observations with 'anomaly_score' greater than this 'threshold' will be the anomalous ones.
Of course, a closer look at the distribution of the anomaly_scores
may help us a lot with this decision.
In [23]:
anomaly_scores_sketch = model_lof['scores']['anomaly_score'].sketch_summary()
print anomaly_scores_sketch
In [24]:
threshold = anomaly_scores_sketch.quantile(0.9)
anomalies_mask = model_lof['scores']['anomaly_score'] >= threshold
anomalies = model_lof['scores'][anomalies_mask]
print 'Threshold: %.5f' % threshold, '\nNumber of Anomalies: %d' % len(anomalies)
In [25]:
anomalies.print_rows(num_rows=10)
Finally, we can filter out the customer_data
set by the anomalies['row_id']
to obtain the original features of these anomalous data points in record.
In [26]:
customer_data = customer_data.add_row_number(column_name='row_id')
anomalous_customer_data = customer_data.filter_by(anomalies['row_id'], 'row_id')
anomalous_customer_data.print_rows(num_rows=200)
In [ ]: